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f*"^ , Abstract. Semi-supervised clustering aims to introduce prior knowl- 

CNJ ' edge in the decision process of a clustering algorithm. In this paper, 

►^" we propose a novel semi-supervised clustering algorithm based on the 

j^ . information-maximization principle. The proposed method is an exten- 

sion of a previous unsupervised information-maximization clustering al- 
gorithm based on squared-loss mutual information to effectively incorpo- 
rate must-links and cannot-links. The proposed method is computation- 
ally efficient because the clustering solution can be obtained analytically 
via eigendecomposition. Furthermore, the proposed method allows sys- 
tematic optimization of tuning parameters such as the kernel width, given 
l_J ' the degree of belief in the must-links and cannot-links. The usefulness of 

jy! , the proposed method is demonstrated through experiments. 

o 

Keywords: Clustering, Information Maximization, Squared-Loss Mu- 
1 . tual Information, Semi-supervised. 

> 

{Sj ■ 1 Introduction 

O' 

The objective of clustering is to classify unlabeled data into disjoint groups based 
■^- \ on their similarity, and clustering has been extensively studied in statistics and 

machine learning. K-means [12| is a classic algorithm that clusters data so that 
the sum of within-cluster scatters is minimized. However, its usefulness is rather 
limited in practice because k-means only produces linearly separated clusters. 
Kernel k-means [5| overcomes this limitation by performing k-means in a feature 
space induced by a reproducing kernel function [15j | . Spectral clustering [171 Il3| 
first unfolds non-linear data manifolds based on sample-sample similarity by a 
Cd ■ spectral embedding method, and then performs k-means in the embedded space. 

These non-linear clustering techniques is capable of handling highly complex 
real- world data. However, they lack objective model selection strategies, i.e., 
tuning parameters included in kernel functions or similarity measures need to 
be manually determined in an unsupervised manner. Information-maximization 
clustering can address the issue of model selection [l|, |7J, Il8j . which learns a 
probabilistic classifier so that some information measure between feature vec- 
tors and cluster assignments is maximized in an unsupervised manner. In the 
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information-maximization approach, tuning parameters included in kernel func- 
tions or similarity measures can be systematically determined based on the 
information-maximization principle. Among the information-maximization clus- 
tering methods, the algorithm based on squared-loss mutual information (SMI) 
was demonstrated to be promising [18j, because it gives the clustering solution 
analytically via eigendecomposition. 

In practical situations, additional side information regarding clustering so- 
lutions is often provided, typically in the form of must-links and cannot-links: 
A set of sample pairs which should belong to the same cluster and a set of 
sample pairs which should belong to different clusters, respectively. Such semi- 
supervised clustering (which is also known as clustering with side information) 
has been shown to be useful in practice [22| . |g, |23| . Spectral learning [9|] is a semi- 
supervised extension of spectral clustering that enhances the similarity with side 
information so that sample pairs tied with must-links have higher similarity and 
sample pairs tied with cannot-links have lower similarity. On the other hand, 
constrained spectral clustering [24] incorporates the must-links and cannot-links 
as constraints in the optimization problem. 

However, in the same way as unsupervised clustering, the above semi- 
supervised clustering methods suffer from lack of objective model selection 
strategies and thus tuning parameters included in similarity measures need to be 
determined manually. In this paper, we extend the unsupervised SMI-based clus- 
tering method to the semi-supervised clustering scenario. The proposed method, 
called semi-supervised SMI-based clustering (3SMIC), gives the clustering solu- 
tion analytically via eigendecomposition with a systematic model selection strat- 
egy. Through experiments on real- world datasets, we demonstrate the usefulness 
of the proposed 3SMIC algorithm. 

2 Information-Maximization Clustering with 
Squared-Loss Mutual Information 

In this section, we formulate the problem of information-maximization clustering 
and review an existing unsupervised clustering method based on squared-loss 
mutual information. 



2.1 Information-Maximization Clustering 

The goal of unsupervised clustering is to assign class labels to data instances 
so that similar instances share the same label and dissimilar instances have 
different labels. Let {a:i|aij £ K d }" =1 be feature vectors of data instances, which 
are drawn independently from a probability distribution with density p*{x). Let 
{Villli £ {1) • • • > c }}™=i be class labels that we want to obtain, where c denotes 
the number of classes and we assume c to be known through the paper. 

The information-maximization approach tries to learn the class-posterior 
probability p*(y\x) in an unsupervised manner so that some "information" mea- 
sure between feature x and label y is maximized. Mutual information (MI) [16| 



Semi-Supervised Information-Maximization Clustering 
is a typical information measure for this purpose [l|, |7| : 



> p {x,y)log . 

<^ p*{x)p*(y) 



ML:= I2^p*{x, y) log ^ J;/„,, dx. (1) 



An advantage of the information-maximization formulation is that tuning pa- 
rameters included in clustering algorithms such as the Gaussian width and 
the regularization parameter can be objectively optimized based on the same 
information-maximization principle. However, MI is known to be sensitive to 
outliers [3j], due to the log function that is strongly non-linear. Furthermore, 
unsupervised learning of class-posterior probability p*(y\x) under MI is highly 
non-convex and finding a good local optimum is not straightforward in practice 

0. 

To cope with this problem, an alternative information measure called squared- 
loss MI (SMI) has been introduced pit ]: 



SMl:=UY,p*(x)p*( y )(- 
ZJ v=i VP 



P*(x,y) 
(x)p*(y) 




(2) 



Ordinary MI is the Kullback-Leibler (KL) divergence [10] from p*(x,y) to 
p*(x)p*(y), while SMI is the Pearson (PE) divergence [14J ]. Both KL and PE 
divergences belong to the class of the Ali-Silvey-Csiszdr divergences [2|, |4(, which 
is also known as the /-divergences. Thus, MI and SMI share many common prop- 
erties, for example they are non-negative and equal to zero if and only if feature 
vector x and label y are statistically independent. Information-maximization 
clustering based on SMI was shown to be computationally advantageous [18j . 
Below, we review the SMI-based clustering (SMIC) algorithm. 

2.2 SMI-Based Clustering 

In unsupervised clustering, it is not straightforward to approximate SMI (|2|) 
because labeled samples are not available. To cope with this problem, let us 
expand the squared term in Eq.([2]). Then SMI can be expressed as 



y=i 



SMI = 1 fy p *(x)p*(y) ( P * {X ' V \ ) dx 

, P*(x,y) 

p*(x)p*(y) 

IJtp^Vixf-^dx- 1 -. (3) 



fy p *(x)p*(y) P * (x ' y } , dx -' ] 
J H y p*(x)p*(y) 



Suppose that the class-prior probability p*{y) is uniform, i.e., 

P(y) = - for y= l,...,c. 

c 
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Then we can express Eq.© as 



»=i 



Let us approximate the class-posterior probability p*(y\x) by the following 
kernel model: 

p(y\x; a) := J^ a v ^K (x, xi), (5) 

»=i 

where a. = (ai,i, . . . , a c .„) T G K c ™ is the parameter vector, T denotes the trans- 
pose, and K (x, x') denotes a kernel function. Let K be the kernel matrix whose 
(i,j) element is given by K(xi,Xj) and let a y — (a Vt i, . . . ,a y , n ) T E W l . Ap- 
proximating the expectation over p*{x) in Eq.(j4]) with the empirical average of 
samples {a;i}™ =1 and replacing the class-posterior probability p*(y\x) with the 
kernel model p(y\x; a), we have the following SMI approximator: 

Under orthonormality of {c* a }£ =1 , a global maximizer is given by the normal- 
ized eigenvectors <p 11 . . . , <p c associated with the eigenvalues Ai > ■ • • > A„ > 
of K. Because the sign of eigenvector <j> is arbitrary, we set the sign as 

4>y = <t>y X si g n (0y In), 

where sign(-) denotes the sign of a scalar and l n denotes the n-dimensional 
vector with all ones. On the other hand, since 

f 1 " 

P*(y) = / p*{y\x)p*(x)dx « -^2p(y\xi;a) =o^Kl n , 

and the class-prior probability was set to be uniform, we have the following 
normalization condition: 

alKl n = -. 

y c 

Furthermore, negative outputs are rounded up to zero to ensure that outputs 
are non-negative. 

Taking these post-processing issues into account, cluster assignment yi for Xi 
is determined as the maximizer of the approximation of p(y\xi): 

[max(Q n ,K<f> )]i [max(O n ,0 v )]i 

yi = argmax ^r^ = argmax 



c m.ax(0 7ll K4> y ) T l n y max(O n , <j) y 
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where n denotes the n-dimensional vector with all zeros, the max operation for 
vectors is applied in the element-wise manner, and [•]{ denotes the i-th element 
of a vector. Note that K<j> y — X y 4> y is used in the above derivation. 

For out-of-sample prediction, cluster assignment y' for new sample x' may 
be obtained as 

max (o, J2 Hi K ( x ', Bi)[<l>v]i) 

y := argmax — . (7) 

v A y max(O n ,0 y ) T l„ 

This clustering algorithm is called the SMI-based clustering (SMIC). 

SMIC may include a tuning parameter, say 9, in the kernel function, and 
the clustering results of SMIC depend on the choice of 9. A notable advantage 
of information-maximization clustering is that such a tuning parameter can be 
systematically optimized by the same information-maximization principle. More 
specifically, cluster assignments {yf}f =1 are first obtained for each possible 9. 
Then the quality of clustering is measured by the SMI value estimated from 
paired samples { (xj , yf ) \ 7—i ■ For this purpose, the method of least-squares mu- 



tual information (LSMI) [21( is useful because LSMI was theoretically proved 
to be the optimal non-parametric SMI approximator [20| ; see Appendix [X] for 
the details of LSMI. Thus, we compute LSMI as a function of 9 and the tuning 
parameter value that maximizes LSMI is selected as the most suitable one: 

maxLSMI(6») 



3 Semi-Supervised SMIC 

In this section, we extend SMIC to a semi-supervised clustering scenario where a 
set of must-links and a set of cannot-links are provided. A must-link (i,j) means 
that Xi and Xj are encouraged to belong to the same cluster, while a cannot-link 
(i,j) means that Xi and Xj are encouraged to belong to different clusters. Let 
M be the must-link matrix with Mij — 1 if a must-link between Xi and Xj is 
given and Mjj = otherwise. In the same way, we define the cannot-link matrix 
C. We assume that M^i = 1 for all i = 1, . . . , n, and Ci^ = for all? = 1, . . . , n. 
Below, we explain how must-link constraints and cannot-link constraints are 
incorporated into the SMIC formulation. 



3.1 Incorporating Must-Links in SMIC 

When there exists a must- link between x^ and Xj, we want them to share the 
same class label. Let 

Pi = (p*(y = i|*»)) ■■■,p*(y = c\xi)) T 

be the soft-response vector for Xi . Then the inner product {p* ,p*} is maximized 
if and only if Xi and Xj belong to the same cluster with perfect confidence, 
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i.e., p* and p* are the same vector that commonly has 1 in one element and 
otherwise. Thus, the must-link information may be utilized by increasing (p*,p*) 
if Mi j = 1. We implement this idea as 

n c 

SMI + 7- VJ M i,j^2p(y\ x i\Oi)p(y\xj;a), 
»iJ— i v= l 

where 7 > determines how strongly we encourage the must-links to be satisfied. 
Let us further utilize the following fact: If Xi and Xj belong to the same class 
and Xj and Xk belong to the same class, X{ and Xk also belong to the same 
class (i.e., a friend's friend is a friend). Letting M[ ■ = Yjfc=i Mi^M^j, we can 
incorporate this in SMIC as 



SMI + 7- YJ Mjj Y^p(y|a;»; a)p(y\xj\ a) 

m=i y=i 



7 '^; E M ljJ2p(y\ x i> a )p(y\ x i'> «) 



2n ^ y y 2 ra ^ y y 2n ^ v y 

y=\ y=l y=\ 

= i E "v X ( J + 2 ^ M + T'M 2 )K« S - \ 
If we set 7' = 7 2 , we have a simpler form: 



c 

2n 



y=i 
which will be used later. 

3.2 Incorporating Cannot-Links in SMIC 

We may incorporate cannot-links in SMIC in the opposite way to must-links, by 
decreasing the inner product {p*,pV) to zero. This may be implemented as 



SMI -77- YJ d,j Y2p(y\xi;a)p(y\xj;a), 
»,j=i y =1 



where 77 > determines how strongly we encourage the cannot-links to be sat- 
isfied. 

In binary clustering problems where c — 2, if X{ and Xj belong to different 
classes and Xj and Xk belong to different classes, Xi and Xk actually belong to 
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the same class (i.e., an enemy's enemy is a friend). Let C[ , = Ylk=i Ci,kCkj, 
and we will take this also into account as must-links in the following way: 

n c 

SMI -77- Y^ C lJ ^2p(y\x l ;a.)p(y\x :j ;a) 

n c 

*)i=i y= l 



c 

2n 



£ <x~lK(I - 2 V C + r,'C 2 )Koc y - \. 
y=i 

If we set v( = rj 2 , we have 



c 

2n 



^a^J-^CfKa.-i 



which will be used later. 

3.3 Kernel Matrix Modification 

Another approach to incorporating must-links and cannot-links is to modify the 
kernel matrix K. More specifically, Kij is increased if there exists a must-link 
between Xi and Xj, and iiTjj is decreased if there exists a cannot-link between Xi 
and :Ej. In this paper, we assume Kij s [0, 1], and set Ki^ — 1 if there exists a 
must-link between Xi and a?j and iQj = if there exists a cannot-link between 
Xi and Xj. Let us denote the modified kernel matrix by K : 

K' i K. 

This modification idea has been employed in spectral clustering [9] and 
demonstrated to be promising. 

3.4 Semi-Supervised SMIC 

Finally, we combine the above three ideas as 

y=i 
where 

U := K'(2I + 2 7 M + 7 2 Af 2 - 2r/C + rj 2 C 2 )K' (8) 

When c > 2, we fix 77 at zero. 

This is the learning criterion of semi-supervised SMIC (3SMIC), whose global 
maximizer can be analytically obtained under orthonormality of {a y }y—i by the 
leading eigenvectors of U. Then the same post-processing as the original SMIC 
is applied and cluster assignments are obtained. Out-of-sample prediction is also 
possible in the same way as the original SMIC. 
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3.5 Tuning Parameter Optimization in 3SMIC 

In the original SMIC, an SMI approximator called LSMI is used for tuning 
parameter optimization (see Appendix IX)) . However, this is not suitable in semi- 
supervised scenarios because the 3SMIC solution is biased to satisfy must-links 
and cannot-links. Here, we propose using 

maxLSMI((9) + Penalty(0), 

where 9 indicates tuning parameters in 3SMIC; in the experiments, 7, 77, and the 
parameter t included in the kernel function K(x, x') is optimized. "Penalty" is 
the penalty for violating must-links and cannot-links, which is the only tuning 
factor in the proposed algorithm. 



4 Experiments 

In this section, we experimentally evaluate the performance of the proposed 
3SMIC method in comparison with popular semi-supervised clustering methods: 
Spectral Learning (SL) [9( and Constrained Spectral Clustering (CSC) [2J|. Both 
methods first perform semi-supervised spectral embedding and then k-means to 
obtain clustering results. However, we observed that the post k-means step is 



often unreliable, so we use simple thresholding [17[ in the case of binary clustering 
for CSC. 

In all experiments, we will use a sparse version of the local-scaling kernel [25j] 
as the similarity measure: 



A-Ca,.*.,-, = < 6XP { l ^o J -) if Xl €Aft{Xj) OT Xj €AftiXi) > 

otherwise, 

where Nt{x) denotes the set of t nearest neighbors for x (t is the kernel param- 
eter), <j, is a local scaling factor defined as a\ — \\xi — x\ ' ||, and x\ ' is the rj-th 
nearest neighbor of Xi- For SL and CSC, we test t = 1, 4, 7, 10 (note that there 
is no systematic way to choose the value of t) , except for the spam dataset with 
t = 1 that caused numerical problems in the eigensolver when testing SL. On 
the other hand, in 3SMIC, we choose the value of t from {1, . . . , 10} based on 
the following criterion: 

LSMI(0) n v 



maxg LSMI(0) maxe(n„)' 

where n v is the number of violated links. Here, both the LSMI value and the 
penalty are normalized so that they fall into the range [0, 1]. The 7 and 77 pa- 
rameters in 3SMIC are also chosen based on Eq.©. 
We use the following real- world datasets: 
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parkinson (d = 22, n — 195, and c = 2): The UCI dataset consisting of voice 
registration from patients suffering Parkinson's disease and sane individuals. 
From the voice, 22 feature are extracted. 

spam (d = 57, n = 4601, and c = 2): The UCI dataset consisting of e-mails, cat- 
egorized in spam and non-spam. 48 word-frequency features and 9 other fre- 
quency features such as specific characters and capitalization are extracted. 

sonar (d = 60, n = 208, and c = 2): The UCI dataset consisting of sonar 
responses from a metal object or a rock. The features represent energy in 
each frequency band. 

digits500 (d = 256, n = 500, and c = 10): The USPS digits dataset consisting 
of images of written numbers from to 9, 256 (16 x 16) pixels in gray-scale. 
We randomly sampled 50 numbers for each digit, and normalized each pixel 
intensity in the image between —1 and 1. 

digits5k (d = 256, n = 5000, and c = 10): The same USPS digits dataset but 
with 500 images for each class. 

faceslOO (d = 4096, n = 100, and c = 10): The Olivetti Face dataset consisting 
of images of human faces in gray-scale, 4096 (64 x 64) pixels. We randomly 
selected 10 persons, and used 10 images for each person. 

Must-links and cannot-links are generated from the true labels, by randomly 
sampling a couple of points and adding the corresponding 1 to the M or C 
matrices depending on the labels of the chosen pair of points. CSC is excluded 
from digits5k and spam because it needs to solve the complete eigenvalue 
problem and its computational cost was too high on these large datasets. 

We evaluate the clustering performance by the Adjusted Rand Index (ARI) 
[8( between learned and true labels. Larger ARI values mean better clustering 
performance, and the zero ARI value means that the clustering result is equiv- 
alent to random. We investigate the ARI score as functions of the number of 
links used. Averages and standard deviations of ARI over 20 runs with different 
random seeds are plotted in Figure [TJ 

We can separate the datasets into two groups. For digits500, digits5k, and 
faceslOO, the baseline performances without links are reasonable; the introduc- 
tion of links significantly increase the performance, bringing it around 0.9-0.95 
from 0.5-0.8. 

For parkinson, spam, and sonar where the baseline performances without 
links are poor, introduction of links quickly allow the clustering algorithms to 
find better solutions. In particular, only 3% of links (relative to all possible pairs) 
was sufficient for parkinson to achieve reasonable performance and surprisingly 
only 0.1% for spam. 

As shown in Figure [TJ the performance of SL depends heavily on the choice 
of t, but there is no systematic way to choose t for SL. It is important to notice 
that 3SMIC with t chosen systematically based on Eq.® performs as good as 
SL with t tuned optimally with hindsight. On the other hand, CSC performs 
rather stably for different values of t, and it works particularly well for binary 
problems with a small number of links. However, it performs very poorly for 
multi-class problems; we observed that the post k-means step is highly unreliable 
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Fig. 1. Experimental results. 
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and poor local optimal solutions are often produced. For the binary problems, 
simply performing thresholding [17| instead of using k-means was found to be 
useful. However, there seems no simple alternatives in multi-class cases. The 
performance of CSC drops in parkinson and sonar when the number of links 
is increased, although such phenomena were not observed in SL and 3SMIC. 

Overall, the proposed 3SMIC method was shown to be a promising semi- 
supervised clustering method. 

5 Conclusions 

In this paper, we proposed a novel information-maximization clustering method 
that can utilize side information provided as must-links and cannot-links. The 
proposed method, named semi- supervised SMI-based clustering (3SMIC), allows 
us to compute the clustering solution analytically. This is a strong advantage 
over conventional approaches such as constrained spectral clustering (CSC) that 
requires a post k-means step, because this post k-means step can be unreli- 
able and cause significant performance degradation in practice. Furthermore, 
3SMIC allows us to systematically determine tuning parameters such as the ker- 
nel width based on the information-maximization principle, given our reliance 
on the provided side information. Through experiments, we demonstrated that 
automatically-tuned 3SMIC perform as good as optimally-tuned spectral learn- 
ing (SL) with hindsight. 

The focus of our method in this paper was to inherit the analytical treat- 
ment of the original unsupervised SMIC in semi-supervised learning scenarios. 
Although this analytical treatment was demonstrated to be highly useful in 
experiments, our future work will explore more efficient use of must-links and 
cannot-links. 

In the previous work [ll| . negative eigenvalues were found to contain useful 
information. Because must-link and cannot-link matrices can possess negative 
eigenvalues, it is interesting to investigate the role and effect of negative eigen- 
values in the context of information-maximization clustering. 
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A Least-Squares Mutual Information 

The solution of SMIC depends on the choice of the kernel parameter included 
in the kernel function K(x,x'). Since SMIC was developed in the framework 
of SMI maximization, it would be natural to determine the kernel parameter 
so as to maximize SMI. A direct approach is to use the SMI estimator SMI 
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given by Eq.© also for kernel parameter choice. However, this direct approach 
is not favorable because SMI is an unsupervised SMI estimator (i.e., SMI is 
estimated only from unlabeled samples {a?i}" =1 ). On the other hand, in the model 
selection stage, we have already obtained labeled samples {(xi, 2/i)}" =1 , and thus 
supervised estimation of SMI is possible. For supervised SMI estimation, a non- 
parametric SMI estimator called least-squares mutual information (LSMI) [2l| 
was proved to achieve the optimal convergence rate to the true SMI. Here we 
briefly review LSMI. 

The key idea of LSMI is to learn the following density-ratio function |19l | , 

r (x,y) — 



p*{x)p*{yY 



without going through probability density/mass estimation of p*(x,y), p*(x), 
and p*(y). More specifically, let us employ the following density-ratio model: 

r(x,y;u>):= J^ uj e L(x,x e ), (10) 

£:ye=y 

where u> = (u\, . . . , w n ) T and L(x, x 1 ) is a kernel function. In practice, we use 
the Gaussian kernel 



L(x,x) =cxp I — I 



where the Gaussian width k is the kernel parameter. To save the computation 
cost, we limit the number of kernel bases to 500 with randomly selected kernel 
centers. 

The parameter w in the above density-ratio model is learned so that the 
following squared error is minimized: 

m j, n o / 51 ( r ( x >y'' u ) ~ r *0;2/)) p*(x)p*(y)dx. (11) 

Let u>( y > be the parameter vector corresponding to the kernel bases 
{L(x,X£)}i;y e=y , i.e., w'"' is the sub"Vector of u> = (u>i, . . . ,w n ) T consisting 
of indices {£ \ yi = y}. Let n y be the number of samples in class y, which is the 
same as the dimensionality of u^ y '. Then an empirical and regularized version 
of the optimization problem dill) is given for each y as follows: 



mm 



L(v)Tg (!,) W fe)- W W T ft (l/) + ^ T CJ^ 



(12) 



where S (> 0) is the regularization parameter. H is the n y x n y matrix and 

c-GO 

h is the rij,-dimensional vector defined as 

n . 

1,1' : ~ — 2 / ^ l J \ x i> x t jJ^iXijXg, ), /i^ := — > ^ L[Xi,X{, ), 

»'=1 i-Vi=V 



Semi-Supervised Information-Maximization Clustering 13 

where xf is the £-th sample in class y (which corresponds to uJ e y ). 

A notable advantage of LSMI is that the solution u> • y ' can be computed 
analytically as 



LO 



(v) 



(H +SI)- 1 h { '. 



Then a density-ratio estimator is obtained analytically as follows: 

n y 

The accuracy of the above least-squares density-ratio estimator depends on 
the choice of the kernel parameter k included in L{x, x') and the regularization 
parameter 5 in Eq. (|12p . These tuning parameter values can be systematically op- 
timized based on cross-validation as follows: First, the samples Z = {(xi, 2/i)}™ =1 
are divided into M disjoint subsets {Z m }^ =1 of approximately the same size 
(we use M = 5 in the experiments). Then a density-ratio estimator r m (x, y) is 
obtained using Z\Z m (i.e., all samples without Z m ), and its out-of-sample er- 
ror (which corresponds to Eq. (fTTj) without irrelevant constant) for the hold-out 
samples Z m is computed as 

CV m := Y^ ? rn{x,y) 2 - —— Y r m (x,y), 

x,y£Z m (x,y)eZ m 

where ^ x eZ denotes the summation over all combinations of x and y in Z m 
(and thus \Z m \ 2 terms), while J2( x y)ez denotes the summation over all pairs 
(x, y) in Z m (and thus \Z m \ terms). This procedure is repeated for m = 1, . . . , M, 
and the average of the above hold-out error over all m is computed as 

1 M 

m— 1 

Then the kernel parameter k and the regularization parameter S that minimize 
the average hold-out error CV are chosen as the most suitable ones. 
Finally, given that SMI © can be expressed as 

SMI = -- jJ2 r *( x > y) 2 p*{x)p*(y)dx + / ^ r*(x, y)p*(x, y)dx - -, 
J y=i J »=i 

an SMI estimator based on the above density-ratio estimator, called least-squares 
mutual information (LSMI), is given as follows: 

1 " 1 " 1 

LSMI := "2^ 2 rixuytf + - $>(^,R) - -, 

i,j'=l i— 1 

where r(x,y) is a density-ratio estimator obtained above. 
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